Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import warnings
warnings.filterwarnings("ignore")
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
train=pd.read_csv('Train.csv')
test=pd.read_csv('Test.csv')
df=train.copy()
df.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.464606 | -4.679129 | 3.101546 | 0.506130 | -0.221083 | -2.032511 | -2.910870 | 0.050714 | -1.522351 | 3.761892 | ... | 3.059700 | -1.690440 | 2.846296 | 2.235198 | 6.667486 | 0.443809 | -2.369169 | 2.950578 | -3.480324 | 0 |
| 1 | 3.365912 | 3.653381 | 0.909671 | -1.367528 | 0.332016 | 2.358938 | 0.732600 | -4.332135 | 0.565695 | -0.101080 | ... | -1.795474 | 3.032780 | -2.467514 | 1.894599 | -2.297780 | -1.731048 | 5.908837 | -0.386345 | 0.616242 | 0 |
| 2 | -3.831843 | -5.824444 | 0.634031 | -2.418815 | -1.773827 | 1.016824 | -2.098941 | -3.173204 | -2.081860 | 5.392621 | ... | -0.257101 | 0.803550 | 4.086219 | 2.292138 | 5.360850 | 0.351993 | 2.940021 | 3.839160 | -4.309402 | 0 |
| 3 | 1.618098 | 1.888342 | 7.046143 | -1.147285 | 0.083080 | -1.529780 | 0.207309 | -2.493629 | 0.344926 | 2.118578 | ... | -3.584425 | -2.577474 | 1.363769 | 0.622714 | 5.550100 | -1.526796 | 0.138853 | 3.101430 | -1.277378 | 0 |
| 4 | -0.111440 | 3.872488 | -3.758361 | -2.982897 | 3.792714 | 0.544960 | 0.205433 | 4.848994 | -1.854920 | -6.220023 | ... | 8.265896 | 6.629213 | -10.068689 | 1.222987 | -3.229763 | 1.686909 | -2.163896 | -3.644622 | 6.510338 | 0 |
5 rows × 41 columns
df.duplicated().sum()
#no duplicate rows...
0
df.describe(include='all').T
#there may be missing data in V1 and V2.
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 19982.0 | -0.271996 | 3.441625 | -11.876451 | -2.737146 | -0.747917 | 1.840112 | 15.493002 |
| V2 | 19982.0 | 0.440430 | 3.150784 | -12.319951 | -1.640674 | 0.471536 | 2.543967 | 13.089269 |
| V3 | 20000.0 | 2.484699 | 3.388963 | -10.708139 | 0.206860 | 2.255786 | 4.566165 | 17.090919 |
| V4 | 20000.0 | -0.083152 | 3.431595 | -15.082052 | -2.347660 | -0.135241 | 2.130615 | 13.236381 |
| V5 | 20000.0 | -0.053752 | 2.104801 | -8.603361 | -1.535607 | -0.101952 | 1.340480 | 8.133797 |
| V6 | 20000.0 | -0.995443 | 2.040970 | -10.227147 | -2.347238 | -1.000515 | 0.380330 | 6.975847 |
| V7 | 20000.0 | -0.879325 | 1.761626 | -7.949681 | -2.030926 | -0.917179 | 0.223695 | 8.006091 |
| V8 | 20000.0 | -0.548195 | 3.295756 | -15.657561 | -2.642665 | -0.389085 | 1.722965 | 11.679495 |
| V9 | 20000.0 | -0.016808 | 2.160568 | -8.596313 | -1.494973 | -0.067597 | 1.409203 | 8.137580 |
| V10 | 20000.0 | -0.012998 | 2.193201 | -9.853957 | -1.411212 | 0.100973 | 1.477045 | 8.108472 |
| V11 | 20000.0 | -1.895393 | 3.124322 | -14.832058 | -3.922404 | -1.921237 | 0.118906 | 11.826433 |
| V12 | 20000.0 | 1.604825 | 2.930454 | -12.948007 | -0.396514 | 1.507841 | 3.571454 | 15.080698 |
| V13 | 20000.0 | 1.580486 | 2.874658 | -13.228247 | -0.223545 | 1.637185 | 3.459886 | 15.419616 |
| V14 | 20000.0 | -0.950632 | 1.789651 | -7.738593 | -2.170741 | -0.957163 | 0.270677 | 5.670664 |
| V15 | 20000.0 | -2.414993 | 3.354974 | -16.416606 | -4.415322 | -2.382617 | -0.359052 | 12.246455 |
| V16 | 20000.0 | -2.925225 | 4.221717 | -20.374158 | -5.634240 | -2.682705 | -0.095046 | 13.583212 |
| V17 | 20000.0 | -0.134261 | 3.345462 | -14.091184 | -2.215611 | -0.014580 | 2.068751 | 16.756432 |
| V18 | 20000.0 | 1.189347 | 2.592276 | -11.643994 | -0.403917 | 0.883398 | 2.571770 | 13.179863 |
| V19 | 20000.0 | 1.181808 | 3.396925 | -13.491784 | -1.050168 | 1.279061 | 3.493299 | 13.237742 |
| V20 | 20000.0 | 0.023608 | 3.669477 | -13.922659 | -2.432953 | 0.033415 | 2.512372 | 16.052339 |
| V21 | 20000.0 | -3.611252 | 3.567690 | -17.956231 | -5.930360 | -3.532888 | -1.265884 | 13.840473 |
| V22 | 20000.0 | 0.951835 | 1.651547 | -10.122095 | -0.118127 | 0.974687 | 2.025594 | 7.409856 |
| V23 | 20000.0 | -0.366116 | 4.031860 | -14.866128 | -3.098756 | -0.262093 | 2.451750 | 14.458734 |
| V24 | 20000.0 | 1.134389 | 3.912069 | -16.387147 | -1.468062 | 0.969048 | 3.545975 | 17.163291 |
| V25 | 20000.0 | -0.002186 | 2.016740 | -8.228266 | -1.365178 | 0.025050 | 1.397112 | 8.223389 |
| V26 | 20000.0 | 1.873785 | 3.435137 | -11.834271 | -0.337863 | 1.950531 | 4.130037 | 16.836410 |
| V27 | 20000.0 | -0.612413 | 4.368847 | -14.904939 | -3.652323 | -0.884894 | 2.189177 | 17.560404 |
| V28 | 20000.0 | -0.883218 | 1.917713 | -9.269489 | -2.171218 | -0.891073 | 0.375884 | 6.527643 |
| V29 | 20000.0 | -0.985625 | 2.684365 | -12.579469 | -2.787443 | -1.176181 | 0.629773 | 10.722055 |
| V30 | 20000.0 | -0.015534 | 3.005258 | -14.796047 | -1.867114 | 0.184346 | 2.036229 | 12.505812 |
| V31 | 20000.0 | 0.486842 | 3.461384 | -13.722760 | -1.817772 | 0.490304 | 2.730688 | 17.255090 |
| V32 | 20000.0 | 0.303799 | 5.500400 | -19.876502 | -3.420469 | 0.052073 | 3.761722 | 23.633187 |
| V33 | 20000.0 | 0.049825 | 3.575285 | -16.898353 | -2.242857 | -0.066249 | 2.255134 | 16.692486 |
| V34 | 20000.0 | -0.462702 | 3.183841 | -17.985094 | -2.136984 | -0.255008 | 1.436935 | 14.358213 |
| V35 | 20000.0 | 2.229620 | 2.937102 | -15.349803 | 0.336191 | 2.098633 | 4.064358 | 15.291065 |
| V36 | 20000.0 | 1.514809 | 3.800860 | -14.833178 | -0.943809 | 1.566526 | 3.983939 | 19.329576 |
| V37 | 20000.0 | 0.011316 | 1.788165 | -5.478350 | -1.255819 | -0.128435 | 1.175533 | 7.467006 |
| V38 | 20000.0 | -0.344025 | 3.948147 | -17.375002 | -2.987638 | -0.316849 | 2.279399 | 15.289923 |
| V39 | 20000.0 | 0.890653 | 1.753054 | -6.438880 | -0.272250 | 0.919261 | 2.057540 | 7.759877 |
| V40 | 20000.0 | -0.875630 | 3.012155 | -11.023935 | -2.940193 | -0.920806 | 1.119897 | 10.654265 |
| Target | 20000.0 | 0.055500 | 0.228959 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
df.isnull().sum()
#we'll deal with missing values as part of the later pipeline.
V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
df['Target'].value_counts()
0 18890 1 1110 Name: Target, dtype: int64
(df['Target'].value_counts()[1])/(df['Target'].value_counts()[0])
0.058761249338274216
(test['Target'].value_counts()[1])/(test['Target'].value_counts()[0])
0.05977108944467995
test.describe().T
#just checking test's Target variable for missing values.
#V1 and V2 are also missing values, so there is no need to impute anything but V1 and V2 on the test dataset.
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 4995.0 | -0.277622 | 3.466280 | -12.381696 | -2.743691 | -0.764767 | 1.831313 | 13.504352 |
| V2 | 4994.0 | 0.397928 | 3.139562 | -10.716179 | -1.649211 | 0.427369 | 2.444486 | 14.079073 |
| V3 | 5000.0 | 2.551787 | 3.326607 | -9.237940 | 0.314931 | 2.260428 | 4.587000 | 15.314503 |
| V4 | 5000.0 | -0.048943 | 3.413937 | -14.682446 | -2.292694 | -0.145753 | 2.166468 | 12.140157 |
| V5 | 5000.0 | -0.080120 | 2.110870 | -7.711569 | -1.615238 | -0.131890 | 1.341197 | 7.672835 |
| V6 | 5000.0 | -1.042138 | 2.005444 | -8.924196 | -2.368853 | -1.048571 | 0.307555 | 5.067685 |
| V7 | 5000.0 | -0.907922 | 1.769017 | -8.124230 | -2.054259 | -0.939695 | 0.212228 | 7.616182 |
| V8 | 5000.0 | -0.574592 | 3.331911 | -12.252731 | -2.642088 | -0.357943 | 1.712896 | 10.414722 |
| V9 | 5000.0 | 0.030121 | 2.174139 | -6.785495 | -1.455712 | -0.079891 | 1.449548 | 8.850720 |
| V10 | 5000.0 | 0.018524 | 2.145437 | -8.170956 | -1.353320 | 0.166292 | 1.511248 | 6.598728 |
| V11 | 5000.0 | -2.008615 | 3.112220 | -13.151753 | -4.050432 | -2.043122 | 0.044069 | 9.956400 |
| V12 | 5000.0 | 1.576413 | 2.907401 | -8.164048 | -0.449674 | 1.488253 | 3.562626 | 12.983644 |
| V13 | 5000.0 | 1.622456 | 2.882892 | -11.548209 | -0.126012 | 1.718649 | 3.464604 | 12.620041 |
| V14 | 5000.0 | -0.921097 | 1.803470 | -7.813929 | -2.110952 | -0.896011 | 0.272324 | 5.734112 |
| V15 | 5000.0 | -2.452174 | 3.387041 | -15.285768 | -4.479072 | -2.417131 | -0.432943 | 11.673420 |
| V16 | 5000.0 | -3.018503 | 4.264407 | -20.985779 | -5.648343 | -2.773763 | -0.178105 | 13.975843 |
| V17 | 5000.0 | -0.103721 | 3.336513 | -13.418281 | -2.227683 | 0.047462 | 2.111907 | 19.776592 |
| V18 | 5000.0 | 1.195606 | 2.586403 | -12.214016 | -0.408850 | 0.881395 | 2.604014 | 13.642235 |
| V19 | 5000.0 | 1.210490 | 3.384662 | -14.169635 | -1.026394 | 1.295864 | 3.526278 | 12.427997 |
| V20 | 5000.0 | 0.138429 | 3.657171 | -13.719620 | -2.325454 | 0.193386 | 2.539550 | 13.870565 |
| V21 | 5000.0 | -3.664398 | 3.577841 | -16.340707 | -5.944369 | -3.662870 | -1.329645 | 11.046925 |
| V22 | 5000.0 | 0.961960 | 1.640414 | -6.740239 | -0.047728 | 0.986020 | 2.029321 | 7.505291 |
| V23 | 5000.0 | -0.422182 | 4.056714 | -14.422274 | -3.162690 | -0.279222 | 2.425911 | 13.180887 |
| V24 | 5000.0 | 1.088841 | 3.968207 | -12.315545 | -1.623203 | 0.912815 | 3.537195 | 17.806035 |
| V25 | 5000.0 | 0.061235 | 2.010227 | -6.770139 | -1.298377 | 0.076703 | 1.428491 | 6.556937 |
| V26 | 5000.0 | 1.847261 | 3.400330 | -11.414019 | -0.242470 | 1.917032 | 4.156106 | 17.528193 |
| V27 | 5000.0 | -0.552397 | 4.402947 | -13.177038 | -3.662591 | -0.871982 | 2.247257 | 17.290161 |
| V28 | 5000.0 | -0.867678 | 1.926181 | -7.933388 | -2.159811 | -0.930695 | 0.420587 | 7.415659 |
| V29 | 5000.0 | -1.095805 | 2.655454 | -9.987800 | -2.861373 | -1.340547 | 0.521843 | 14.039466 |
| V30 | 5000.0 | -0.118699 | 3.023292 | -12.438434 | -1.996743 | 0.112463 | 1.946450 | 10.314976 |
| V31 | 5000.0 | 0.468810 | 3.446324 | -11.263271 | -1.822421 | 0.485742 | 2.779008 | 12.558928 |
| V32 | 5000.0 | 0.232567 | 5.585628 | -17.244168 | -3.556267 | -0.076694 | 3.751857 | 26.539391 |
| V33 | 5000.0 | -0.080115 | 3.538624 | -14.903781 | -2.348121 | -0.159713 | 2.099160 | 13.323517 |
| V34 | 5000.0 | -0.392663 | 3.166101 | -14.699725 | -2.009604 | -0.171745 | 1.465402 | 12.146302 |
| V35 | 5000.0 | 2.211205 | 2.948426 | -12.260591 | 0.321818 | 2.111750 | 4.031639 | 13.489237 |
| V36 | 5000.0 | 1.594845 | 3.774970 | -12.735567 | -0.866066 | 1.702964 | 4.104409 | 17.116122 |
| V37 | 5000.0 | 0.022931 | 1.785320 | -5.079070 | -1.240526 | -0.110415 | 1.237522 | 6.809938 |
| V38 | 5000.0 | -0.405659 | 3.968936 | -15.334533 | -2.984480 | -0.381162 | 2.287998 | 13.064950 |
| V39 | 5000.0 | 0.938800 | 1.716502 | -5.451050 | -0.208024 | 0.959152 | 2.130769 | 7.182237 |
| V40 | 5000.0 | -0.932406 | 2.978193 | -10.076234 | -2.986587 | -1.002764 | 1.079738 | 8.698460 |
| Target | 5000.0 | 0.056400 | 0.230716 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
def histobox(data, feature, figsize=(15, 5), kde=False, bins=100):
#bins= number of sections of a histogram
HBP, (box,hist) = plt.subplots(
nrows=2,
sharex=True, #sharex and sharey just means the x and y axis won't show up for both, only one (usually bottom or left)
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize
)
sns.boxplot(
data=data, x=feature, ax=box, showmeans=True, color="orange"
)
sns.histplot(
data=data, x=feature, kde=kde, ax=hist, bins=bins, palette="mako" #i like mako
)
hist.axvline(
data[feature].mean(), color="green", linestyle="--"
)
hist.axvline(
data[feature].median(), color="black", linestyle="-"
)
for feature in df.columns:
histobox(df, feature, figsize=(12, 7), kde=False, bins=2000) ## Please change the dataframe name as you define while reading the data
sns.heatmap(pd.DataFrame(df.corr()['Target']).T,annot=True,fmt='.3f',cmap='Reds')
plt.gcf().set_size_inches(60,1)
sns.pairplot(df);